SITE LINK

KMID : 1132720200180030033

Genomics & Informatics
2020 Volume.18 No. 3 p.33 ~ p.33

Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

Kim Sun-Ho

Kim Ro-Young
Nam Hee-Jo
Kim Ryeo-Gyeong
Ko En-Jin
Kim Han-Su
Shin Ji-Hye
Cho Da-Eun
Jin Yu-Rhee
Bae So-Yeon
Jo Ye-Won
Jeong San-Ah
Kim Ye-Na
Ahn Seo-Yeon
Jang Bo-Mi
Seong Ji-Heyon
Lee Yu-Jin
Seo Si-Eun
Kim Yu-Jin
Kim Ha-Jeong
Kim Hye-Ji
Sung Hye-Lynn
Lho Hyo-Young
Koo Jay-Won
Chu Ji-On
Lim Ju-Won
Kim Young-Ju
Lee Kyung-Yeon
Lim Yu-Ri
Kim Meong-Eun
Hwang Seon-Jeong
Han Shin-Hye
Bae So-Hyeun
Kim Su-A
Yoo Su-Hyeon
Seo Yeon-Jeong
Shin Ye-Rim
Kim Yon-Soo
Ko You-Jung
Baek Ji-Hee
Hyun Hye-Jin
Choi Hye-Min
Oh Ji-Hye
Kim Da-Young
Park Hyun-Seok

Abstract

This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.

KEYWORD

biomedical text mining, corpus, text analytics

FullTexts / Linksout information

Listed journal information

site infomation

Prohibition of Unauthorized Collection of E-mail Addresses, medric.kyung@gmail.com
N4 301, Chungbuk National University, Chungdae-ro 1, Seowon-Gu, Cheongju, Chungbuk 28644, Korea